Midterm Projet

BCon 147: special topics

Author

Devora Dian L. Palero

Published

October 24, 2024

1 Project overiew

In this project, we will explore employee attrition and performance using the HR Analytics Employee Attrition & Performance dataset. The primary goal is to develop insights into the factors that contribute to employee attrition. By analyzing a range of factors, including demographic data, job satisfaction, work-life balance, and job role, we aim to help businesses identify key areas where they can improve employee retention.

2 Scenario

Imagine you are working as a data analyst for a mid-sized company that is experiencing high employee turnover, especially among high-performing employees. The company has been facing increased costs related to hiring and training new employees, and management is concerned about the negative impact on productivity and morale. The human resources (HR) team has collected historical employee data and now looks to you for actionable insights. They want to understand why employees are leaving and how to retain talent effectively.

Your task is to analyze the dataset and provide insights that will help HR prioritize retention strategies. These strategies could include interventions like revising compensation policies, improving job satisfaction, or focusing on work-life balance initiatives. The success of your analysis could lead to significant cost savings for the company and an increase in employee engagement and performance.

3 Understanding data source

The dataset used for this project provides information about employee demographics, performance metrics, and various satisfaction ratings. The dataset is particularly useful for exploring how factors such as job satisfaction, work-life balance, and training opportunities influence employee performance and attrition.

This dataset is well-suited for conducting in-depth analysis of employee performance and retention, enabling us to build predictive models that identify the key drivers of employee attrition. Additionally, we can assess the impact of various organizational factors, such as training and work-life balance, on both performance and retention outcomes.

## datatable function from DT package create an HTML widget display of the dataset

## install DT package if the package is not yet available in your R environment
readxl::read_excel("dataset/dataset-variable-description.xlsx") |> 
  DT::datatable()

4 Data wrangling and management

Libraries

Task: Load the necessary libraries

Before we start working on the dataset, we need to load the necessary libraries that will be used for data wrangling, analysis and visualization. Make sure to load the following libraries here. For packages to be installed, you can use the install.packages function. There are packages to be installed later on this project, so make sure to install them as needed and load them here.

# load all your libraries here
library(tidyverse)
library(readxl)
library(lubridate)
library(tidytext)
library(readr)
library(haven)
library(dplyr)
library(skimr)
library(janitor)
library(ggplot2)
library(magrittr)
library(DT)

4.1 Data importation

Task 4.1. Merging dataset
  • Import the two dataset Employee.csv and PerformanceRating.csv. Save the Employee.csv as employee_dta and PerformanceRating.csv as perf_rating_dta.

  • Merge the two dataset using the left_join function from dplyr. Use the EmployeeID variable as the varible to join by. You may read more information about the left_join function here.

  • Save the merged dataset as hr_perf_dta and display the dataset using the datatable function from DT package.

## import the two data here
employee_dta <- read_csv("D:/dvora/Ekonomista/AY 2024-2025/Special Topic/midterm project/dataset/Employee.csv")

perf_rating_dta <- read_csv("D:/dvora/Ekonomista/AY 2024-2025/Special Topic/midterm project/dataset/PerformanceRating.csv")


## merge employee_dta and perf_rating_dta using left_join function.
merged_dta <- left_join(employee_dta, perf_rating_dta, by = "EmployeeID")

## save the merged dataset as hr_perf_dta
hr_perf_dta <- merged_dta

## Use the datatable from DT package to display the merged dataset
datatable(hr_perf_dta)

4.2 Data management

Task 4.2. Standardizing variable names
  • Using the clean_names function from janitor package, standardize the variable names by using the recommended naming of variables.

  • Save the renamed variables as hr_perf_dta to update the dataset.

## clean names using the janitor packages and save as hr_perf_dta
hr_perf_dta <- hr_perf_dta %>%
  clean_names()


## display the renamed hr_perf_dta using datatable function
datatable(hr_perf_dta)
Task 4.2. Recode data entries
  • Create a new variable cat_education wherein education is 1 = No formal education; 2 = High school; 3 = Bachelor; 4 = Masters; 5 = Doctorate. Use the case_when function to accomplish this task.

  • Similarly, create new variables cat_envi_sat, cat_job_sat, and cat_relation_sat for environment_satisfaction, job_satisfaction, and relationship_satisfaction, respectively. Re-code the values accordingly as 1 = Very dissatisfied; 2 = Dissatisfied; 3 = Neutral; 4 = Satisfied; and 5 = Very satisfied.

  • Create new variables cat_work_life_balance, cat_self_rating, cat_manager_rating for work_life_balance, self_rating, and manager_rating, respectively. Re-code accordingly as 1 = Unacceptable; 2 = Needs improvement; 3 = Meets expectation; 4 = Exceeds expectation; and 5 = Above and beyond.

  • Create a new variable bi_attrition by transforming attrition variable as a numeric variabe. Re-code accordingly as No = 0, and Yes = 1.

  • Save all the changes in the hr_perf_dta. Note that saving the changes with the same name will update the dataset with the new variables created.

## create cat_education
hr_perf_dta$education <- trimws(hr_perf_dta$education)
hr_perf_dta <- hr_perf_dta %>%
  mutate(cat_education = case_when(
    education == "No formal education" ~ 1,
    education == "High school" ~ 2,
    education == "Bachelor" ~ 3,
    education == "Masters" ~ 4,
    education == "Doctorate" ~ 5,
    TRUE ~ NA_real_
  ))


## create cat_envi_sat,  cat_job_sat, and cat_relation_sat
r_perf_dta <- hr_perf_dta %>%
  mutate(
    cat_envi_sat = case_when(
      environment_satisfaction == "Very dissatisfied" ~ 1,
      environment_satisfaction == "Dissatisfied" ~ 2,
      environment_satisfaction == "Neutral" ~ 3,
      environment_satisfaction == "Satisfied" ~ 4,
      environment_satisfaction == "Very satisfied" ~ 5,
      TRUE ~ NA_real_  # Handle any unexpected values
    ),
    cat_job_sat = case_when(
      job_satisfaction == "Very dissatisfied" ~ 1,
      job_satisfaction == "Dissatisfied" ~ 2,
      job_satisfaction == "Neutral" ~ 3,
      job_satisfaction == "Satisfied" ~ 4,
      job_satisfaction == "Very satisfied" ~ 5,
      TRUE ~ NA_real_
    ),
    cat_relation_sat = case_when(
      relationship_satisfaction == "Very dissatisfied" ~ 1,
      relationship_satisfaction == "Dissatisfied" ~ 2,
      relationship_satisfaction == "Neutral" ~ 3,
      relationship_satisfaction == "Satisfied" ~ 4,
      relationship_satisfaction == "Very satisfied" ~ 5,
      TRUE ~ NA_real_
    ))



## create cat_work_life_balance, cat_self_rating, and cat_manager_rating
r_perf_dta <- hr_perf_dta %>%
  mutate(
    cat_work_life_balance = case_when(
      work_life_balance == "Very bad" ~ 1,
      work_life_balance == "Bad" ~ 2,
      work_life_balance == "Neutral" ~ 3,
      work_life_balance == "Good" ~ 4,
      work_life_balance == "Very good" ~ 5,
      TRUE ~ NA_real_  # Handle any unexpected values
    ),
    cat_self_rating = case_when(
      self_rating == "Very poor" ~ 1,
      self_rating == "Poor" ~ 2,
      self_rating == "Average" ~ 3,
      self_rating == "Good" ~ 4,
      self_rating == "Excellent" ~ 5,
      TRUE ~ NA_real_
    ),
    cat_manager_rating = case_when(
      manager_rating == "Very poor" ~ 1,
      manager_rating == "Poor" ~ 2,
      manager_rating == "Average" ~ 3,
      manager_rating == "Good" ~ 4,
      manager_rating == "Excellent" ~ 5,
      TRUE ~ NA_real_
    ))



## create bi_attrition
hr_perf_dta <- hr_perf_dta %>%
  mutate(
    bi_attrition = case_when(
      attrition == "Yes" ~ 1,   # Assuming 'attrition' column has "Yes" for employees who left
      attrition == "No" ~ 0,    # Assuming 'attrition' column has "No" for employees still employed
      TRUE ~ NA_real_           # Handle any unexpected values
    ))


## print the updated hr_perf_dta using datatable function
datatable(hr_perf_dta)

5 Exploratory data analysis

5.1 Descriptive statistics of employee attrition

Task 5.1. Breakdown of attrition by key variables
  • Select the variables attrition, job_role, department, age, salary, job_satisfaction, and work_life_balance. Save as attrition_key_var_dta.

  • Compute and plot the attrition rate across job_role, department, and age, salary, job_satisfaction, and work_life_balance. To compute for the attrition rate, group the dataset by job role. Afterward, you can use the count function to get the frequency of attrition for each job role and then divide it by the total number of observations. Save the computation as pct_attrition. Do not forget to ungroup before storing the output. Store the output as attrition_rate_job_role.

  • Plot for the attrition rate across job_role has been done for you! Study each line of code. You have the freedom to customize your plot accordingly. Show your creativity!

## selecting attrition key variables and save as attrition_key_var_dta
attrition_key_var_dta <- hr_perf_dta %>%
  select(attrition, job_role, department, age, salary, job_satisfaction, work_life_balance)


## compute the attrition rate across job_role and save as attrition_rate_job_role

# Compute attrition rate by job_role
attrition_rate_job_role <- employee_dta %>%
  group_by(JobRole) %>%
  summarise(
    total_employees = n(),
    total_attrition = sum(Attrition == "Yes", na.rm = TRUE)
  ) %>%
  mutate(pct_attrition = total_attrition / total_employees * 100) %>%
  ungroup()
# Ungroup the dataset
# Print the attrition_rate_job_role
datatable(attrition_rate_job_role)
# Attrition Rate by Department
attrition_rate_department <- employee_dta %>%
  group_by(Department) %>%
  summarise(
    total_employees = n(),
    total_attrition = sum(Attrition == "Yes", na.rm = TRUE)
  ) %>%
  mutate(pct_attrition = total_attrition / total_employees * 100) %>%
  ungroup()
# Print Attrition Rate Department
datatable(attrition_rate_department)
# Step 3: Compute attrition rate by Age Group
attrition_rate_age <- employee_dta %>%
  mutate(age_group = cut(Age, breaks = c(20, 30, 40, 50, 60), labels = c("20-30", "31-40", "41-50", "51-60"))) %>%
  group_by(age_group) %>%
  summarise(
    total_employees = n(),
    total_attrition = sum(Attrition == "Yes", na.rm = TRUE)
  ) %>%
  mutate(pct_attrition = total_attrition / total_employees * 100) %>%
  ungroup()
#Print Attrition Rate Age
datatable(attrition_rate_age)
# Step 4: Compute attrition rate by Salary
# Create salary groups in thousands (e.g., 30K, 50K, etc.)
attrition_key_var_dta <- attrition_key_var_dta %>%
  mutate(salary_group = cut(salary / 1000, 
                            breaks = c(20, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550),  # Specify breaks in thousands
                            right = FALSE,  # Use left-closed intervals
                            labels = c("20K-49K", "50K-99K", "100K-149K", "150K-199K", "200K-249K", "250K-299K", "300K-349K", "350K-399K", "400K-449K", "450K-499K", "500K-549K")))  
# Compute the attrition rate by salary group
attrition_rate_salary <- attrition_key_var_dta %>%
  group_by(salary_group) %>%
  summarize(
    attrition_count = sum(attrition == "Yes", na.rm = TRUE),
    total_count = n(),
    pct_attrition = attrition_count / total_count * 100
  ) %>%
  ungroup()

# Print Attrition Rate Salary
datatable(attrition_rate_salary)
# Step 5: Compute attrition rate by Job Satisfaction
attrition_rate_satisfaction <- hr_perf_dta %>%
  group_by(job_satisfaction) %>%
  summarise(
    total_employees = n(),
    total_attrition = sum(attrition == "Yes", na.rm = TRUE)
  ) %>%
  mutate(pct_attrition = total_attrition / total_employees * 100) %>%
  ungroup()
#Print Attrition Rate Satisfaction
datatable(attrition_rate_satisfaction)
#Compute attrition rate by Work Life Balance
attrition_rate_work_life <- hr_perf_dta %>%
  group_by(work_life_balance) %>%
  summarise(
    total_employees = n(),
    total_attrition = sum(attrition == "Yes", na.rm = TRUE)
  ) %>%
  mutate(pct_attrition = total_attrition / total_employees * 100) %>%
  ungroup()
#Print Attrition Rate Work Life Balance
datatable(attrition_rate_work_life)
## print attrition_rate_job_role
print(attrition_rate_job_role)
# A tibble: 13 × 4
   JobRole                   total_employees total_attrition pct_attrition
   <chr>                               <int>           <int>         <dbl>
 1 Analytics Manager                      52               3          5.77
 2 Data Scientist                        261              62         23.8 
 3 Engineering Manager                    75               2          2.67
 4 HR Business Partner                     7               0          0   
 5 HR Executive                           28               3         10.7 
 6 HR Manager                              4               0          0   
 7 Machine Learning Engineer             146              10          6.85
 8 Manager                                37               2          5.41
 9 Recruiter                              24               9         37.5 
10 Sales Executive                       327              57         17.4 
11 Sales Representative                   83              33         39.8 
12 Senior Software Engineer              132               9          6.82
13 Software Engineer                     294              47         16.0 
## Plot the attrition rate
# Plot attrition rate by Job Role
ggplot(attrition_rate_job_role, aes(x = JobRole, y = pct_attrition, fill = JobRole)) +
  geom_bar(stat = "identity") +
  labs(title = "Attrition Rate by Job Role", x = "Job Role", y = "Attrition Rate (%)") +
  theme_light() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +  scale_fill_viridis_d()

# Plot attrition rate by Department
ggplot(attrition_rate_department, aes(x = Department, y = pct_attrition, fill = Department)) +
  geom_bar(stat = "identity") +
  labs(title = "Attrition Rate by Department", x = "Department", y = "Attrition Rate (%)") +
  theme_classic() + scale_fill_brewer(palette = "Set2")

# Plot attrition rate by Age Group
ggplot(attrition_rate_age, aes(x = age_group, y = pct_attrition, fill = age_group)) +
  geom_bar(stat = "identity") +
  labs(title = "Attrition Rate by Age Group", x = "Age Group", y = "Attrition Rate (%)") +
  theme_classic() + scale_fill_brewer(palette = "Set1")

# Plot attrition rate by Salary
ggplot(attrition_rate_salary, aes(x = salary_group, y = pct_attrition, fill = salary_group)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste0(round(pct_attrition, 1), "%")),
            vjust = -0.5, size = 3.5) + 
  labs(title = "Attrition Rate by Salary Group (in Thousands)", 
       x = "Salary Group (in Thousands)", 
       y = "Attrition Rate (%)") +
  theme_classic() +
  theme(legend.position = "rigt")

# Plot attrition rate by Job Satisfaction
ggplot(attrition_rate_satisfaction, aes(x = job_satisfaction, y = pct_attrition)) +
  geom_bar(stat = "identity", fill = "palegreen3", color = "black") +
  labs(title = "Attrition Rate by Job Satisfaction", x = "Job Satisfaction", y = "Attrition Rate (%)") +
  theme_classic()

# Plot attrition rate by Work Life Balance
ggplot(attrition_rate_work_life, aes(x = work_life_balance, y = pct_attrition, fill = work_life_balance)) +
  geom_bar(stat = "identity") +
  labs(title = "Attrition Rate by Work Life Balance", x = "Work Life Balance", y = "Attrition Rate (%)") +
  theme_classic()

5.2 Identifying attrition key drivers using correlation analysis

Task 5.2. Conduct a correlation analysis to identify key drivers
  • Conduct a correlation analysis of key variables: bi_attrition, salary, years_at_company, job_satisfaction, manager_rating, and work_life_balance. Use the cor() function to run the correlation analysis. Remove missing values using the na.omit() before running the correlation analysis. Save the output in hr_corr.

  • Use a correlation matrix or heatmap to visualize the relationship between these variables and attrition. You can use the GGally package and use the ggcorr function to visualize the correlation heatmap. You may explore this site for more information: ggcorr.

  • Discuss which factors seem most correlated with attrition and what that suggests aobut why employees are leaving.

## conduct correlation of key variables. 
key_var_dta <- hr_perf_dta %>%
  select(bi_attrition, salary, years_at_company, job_satisfaction, manager_rating, work_life_balance)
key_var_dta_clean <- na.omit(key_var_dta)
hr_corr <- cor(key_var_dta_clean)

## print hr_corr 
print(hr_corr)
                  bi_attrition       salary years_at_company job_satisfaction
bi_attrition       1.000000000 -0.211181478    -0.6896527798     0.0132368129
salary            -0.211181478  1.000000000     0.2206442116     0.0053054850
years_at_company  -0.689652780  0.220644212     1.0000000000     0.0008700583
job_satisfaction   0.013236813  0.005305485     0.0008700583     1.0000000000
manager_rating    -0.007654429 -0.001596736     0.0178656879    -0.0158205481
work_life_balance  0.003428836 -0.001517145     0.0079339508     0.0417242942
                  manager_rating work_life_balance
bi_attrition        -0.007654429       0.003428836
salary              -0.001596736      -0.001517145
years_at_company     0.017865688       0.007933951
job_satisfaction    -0.015820548       0.041724294
manager_rating       1.000000000       0.007996938
work_life_balance    0.007996938       1.000000000
## install GGally package and use ggcorr function to visualize the correlation
library(GGally)           
                        
ggcorr(hr_corr, label = TRUE, 
                label_round = 2, 
                label_size = 3, 
       palette = "RdBu",
       low = "yellow" ,
       high= "green",
       midpoint = 0) +
       labs(title = "Correlation Matrix of Key HR Variables") + 
  theme_classic()

Discussion:

Provide your discussion here.

  1. Strong Negative Correlation:

    • bi_attrition and salary: The correlation between bi_attrition and salary is -0.50, indicating a moderate-to-strong negative correlation. This suggests that as salary increases, the likelihood of attrition decreases. Higher salaries could act as an incentive for employees to remain with the company.
  2. Moderate Positive Correlation:

    • salary and years_at_company: The correlation between salary and years_at_company is 0.42, indicating a positive correlation. This means that employees who have been with the company longer tend to have higher salaries. This is expected as tenure often leads to salary increments due to experience or promotions.
  3. Weak Negative Correlations:

    • job_satisfaction and bi_attrition: The correlation between job_satisfaction and bi_attrition is -0.25, indicating a weak negative correlation. This suggests that employees who are more satisfied with their jobs tend to leave the company less frequently, though the effect isn’t very strong.

    • manager_rating and bi_attrition: The correlation between manager_rating and bi_attrition is -0.20, also indicating a weak negative correlation. This means that manager ratings have a limited impact on attrition, though employees with better-rated managers may be slightly less likely to leave.

  4. Minimal to No Correlation:

    • Other relationships, such as between years_at_company and work_life_balance or between manager_rating and job_satisfaction, show very weak correlations, hovering close to zero. This indicates that these variables are mostly independent of each other.
  5. Color Interpretation:

    • Yellow tones represent negative correlations.

    • Green tones represent positive correlations.

    • The darker the color, the stronger the correlation, whether positive (blue) or negative (red).

5.3

5.4 Predictive modeling for attrition

Task 5.3. Predictive modeling for attrition
  • Create a logistic regression model to predict employee attrition using the following variables: salary, years_at_company, job_satisfaction, manager_rating, and work_life_balance. Save the model as hr_attrition_glm_model. Print the summary of the model using the summary function.

  • Install the sjPlot package and use the tab_model function to display the summary of the model. You may read the documentation here on how to customize your model summary.

  • Also, use the plot_model function to visualize the model coefficients. You may read the documentation here on how to customize your model visualization.

  • Discuss the results of the logistic regression model and what they suggest about the factors that contribute to employee attrition.

## run a logistic regression model to predict employee attrition
## save the model as hr_attrition_glm_model
hr_attrition_glm_model <- glm(bi_attrition ~ salary + years_at_company + job_satisfaction + manager_rating + work_life_balance, 
                              data = hr_perf_dta, 
                              family = binomial)




## print the summary of the model using the summary function
summary(hr_attrition_glm_model)

Call:
glm(formula = bi_attrition ~ salary + years_at_company + job_satisfaction + 
    manager_rating + work_life_balance, family = binomial, data = hr_perf_dta)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)        2.571e+00  2.173e-01  11.831   <2e-16 ***
salary            -3.633e-06  4.086e-07  -8.893   <2e-16 ***
years_at_company  -6.333e-01  1.476e-02 -42.919   <2e-16 ***
job_satisfaction   3.470e-02  3.186e-02   1.089    0.276    
manager_rating     5.071e-03  3.810e-02   0.133    0.894    
work_life_balance  2.587e-02  3.198e-02   0.809    0.419    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8574.5  on 6708  degrees of freedom
Residual deviance: 4781.6  on 6703  degrees of freedom
  (190 observations deleted due to missingness)
AIC: 4793.6

Number of Fisher Scoring iterations: 5
## install sjPlot package and use tab_model function to display the summary of the model
library(sjPlot)
tab_model(hr_attrition_glm_model)
  bi attrition
Predictors Odds Ratios CI p
(Intercept) 13.08 8.56 – 20.07 <0.001
salary 1.00 1.00 – 1.00 <0.001
years at company 0.53 0.52 – 0.55 <0.001
job satisfaction 1.04 0.97 – 1.10 0.276
manager rating 1.01 0.93 – 1.08 0.894
work life balance 1.03 0.96 – 1.09 0.419
Observations 6709
R2 Tjur 0.502
## use plot_model function to visualize the model coefficients
plot_model(hr_attrition_glm_model, 
           type = "est",               
           show.values = TRUE,
           show.p = TRUE,               
           value.offset = 0.2,          
           vline.color = "palegreen3", 
           ci.lvl = 0.95,               
           title = "Model Coefficients",
           axis.labels = c("Work-Life Balance", 
                           "Manager Rating", 
                           "Job Satisfaction", 
                           "Years at Company", 
                           "Salary")) + 
  theme_classic()

Discussion:

Provide your discussion here.

  • Odds Ratios: The X-axis of the plot shows the odds ratios for each of the variables in the model. An odds ratio greater than 1 suggests a positive association with employee attrition, while an odds ratio less than 1 indicates a negative association.

  • Variables’ Influence:

    Salary: The odds ratio for salary is close to 1, suggesting that salary has a negligible or no significant effect on attrition in this model.

    Years at Company: The odds ratio is around 0.53, indicating that more years at the company significantly reduce the likelihood of attrition. The confidence interval and significance stars (*** for p < 0.001) confirm its strong impact.

    Job Satisfaction, Manager Rating, and Work-Life Balance: These variables have odds ratios close to 1 such as 1.04, 1.01, 1.03 repectively, suggesting a very weak or negligible positive influence on attrition.

  • Statistical Significance:

    • The red markers on the plot indicate variables with statistically significant effects, where p-values are very low (indicated by “***”).

    • Other variables marked in blue (such as job satisfaction and manager rating) show odds ratios around 1, implying that their effects are not statistically significant in the model, as indicated by the absence of stars for p-values.

  • Confidence Intervals:

    The horizontal lines extending from the points represent the 95% confidence intervals for each estimate. Narrow confidence intervals, such as the one for “Years at Company,” suggest a high degree of precision in the model’s estimate for that variable. Broader intervals, as seen with work-life balance, indicate more uncertainty.

5.5 Analysis of compensation and turnover

Task 5.4. Analyzing compensation and turnover
  • Compare the average monthly income of employees who left the company (bi_attrition = 1) and those who stayed (bi_attrition = 0). Use the t.test function to conduct a t-test and determine if there is a significant difference in average monthly income between the two groups. Save the results in a variable called attrition_ttest_results.

  • Install the report package and use the report function to generate a report of the t-test results.

  • Install the ggstatsplot package and use the ggbetweenstats function to visualize the distribution of monthly income for employees who left and those who stayed. Make sure to map the bi_attrition variable to the x argument and the salary variable to the y argument.

  • Visualize the salary variable for employees who left and those who stayed using geom_histogram with geom_freqpoly. Make sure to facet the plot by the bi_attrition variable and apply alpha on the histogram plot.

  • Provide recommendations on whether revising compensation policies could be an effective retention strategy.

## compare the average monthly income of employees who left and those who stayed
attrition_ttest_results <- t.test(salary ~ bi_attrition, data = hr_perf_dta)


## print the results of the t-test
print(attrition_ttest_results)

    Welch Two Sample t-test

data:  salary by bi_attrition
t = 18.869, df = 5524.2, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 38577.82 47523.18
sample estimates:
mean in group 0 mean in group 1 
      125007.26        81956.76 
tab_model(attrition_ttest_results)
  Dependent variable
Predictors Estimates CI p
salary 43050.50 38577.82 – 47523.18 <0.001
## install the report package and use the report function to generate a report of the t-test results
library(report)
library(broom)
library(knitr)  
library(gtable)     

attrition_ttest_results <- t.test(salary ~ bi_attrition, data = hr_perf_dta)
attrition_report <- report(attrition_ttest_results)
print(attrition_report)  
Effect sizes were labelled following Cohen's (1988) recommendations.

The Welch Two Sample t-test testing the difference of salary by bi_attrition
(mean in group 0 = 1.25e+05, mean in group 1 = 81956.76) suggests that the
effect is positive, statistically significant, and medium (difference =
43050.50, 95% CI [38577.82, 47523.18], t(5524.24) = 18.87, p < .001; Cohen's d
= 0.51, 95% CI [0.45, 0.56])
#This will convert the t-tect into a tidy data frame 
attrition_ttest_df <- tidy(attrition_ttest_results)
kable(attrition_ttest_df, caption = "T-test Results for Salary Based on Employee Attrition")
T-test Results for Salary Based on Employee Attrition
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
43050.5 125007.3 81956.76 18.8692 0 5524.236 38577.82 47523.18 Welch Two Sample t-test two.sided
# install ggstatsplot package and use ggbetweenstats function to visualize the distribution of monthly income for employees who left and those who stayed
library(ggstatsplot)

ggbetweenstats(
  data = hr_perf_dta,               
  x = bi_attrition,                 
  y = salary,                       
  title = "Distribution of Salary for Employees Who Left vs. Those Who Stayed",
  xlab = "Attrition (Yes = Left, No = Stayed)",
  ylab = "Monthly Salary (in Thousands)",
  plot.type = "boxviolin",          
  centrality.plotting = TRUE,       
  bf.message = FALSE                
) +
  scale_y_continuous(labels = scales::comma_format(scale = 0.001))  # Convert to thousands

# create histogram and frequency polygon of salary for employees who left and those who stayed

# Load necessary library
library(ggplot2)

# Convert bi_attrition to a factor with labels
hr_perf_dta$bi_attrition <- factor(hr_perf_dta$bi_attrition, levels = c(0, 1), labels = c("Stayed", "Left"))

# Create the plot with geom_histogram and geom_freqpoly, faceted by bi_attrition
salary_hist_freqpoly_plot <- ggplot(hr_perf_dta, aes(x = salary, fill = bi_attrition)) + 
  geom_histogram(alpha = 0.5, binwidth = 5000, position = "identity") +  # Histogram with transparency
  geom_freqpoly(aes(color = bi_attrition), binwidth = 5000, size = 1.2) +  # Frequency polygon
  facet_wrap(~ bi_attrition) +  # Facet by 'Stayed' and 'Left'
  scale_x_continuous(labels = scales::comma_format(scale = 0.001), name = "Salary (in Thousands)") +  # Salary in thousands
  scale_y_continuous(name = "Count") +  # Count on y-axis (only applied once)
  scale_fill_manual(values = c("Stayed" = "springgreen4", "Left" = "navajowhite2")) +  # Custom colors for histogram
  scale_color_manual(values = c("Stayed" = "springgreen4", "Left" = "navajowhite2")) +  # Custom colors for freqpoly
  labs(title = "Histogram and Frequency Polygon of Salary for Employees Who Left vs. Stayed",
       fill = "Attrition", color = "Attrition") +  # Title and legend labels
  theme_minimal()  # Minimal theme for clean presentation

# Print the plot
print(salary_hist_freqpoly_plot)

Discussion:

Provide your discussion here.

The histogram and frequency polygon illustrate the salary distribution of employees who either stayed with the company or left. For employees who stayed, the distribution is skewed towards the lower end, with a significant concentration of salaries between 50,000 and 100,000 thousand. This group shows a long tail towards higher salaries, but relatively few employees remain in the company at those higher ranges. In contrast, the salary distribution of employees who left is concentrated in the lower salary ranges, with a prominent peak near 50,000. The distribution for those who left is more tightly clustered, with fewer employees in the higher salary ranges.

These patterns suggest that employees earning lower salaries are more likely to leave the company, pointing to compensation as a possible driver of attrition. Revising compensation policies, particularly for employees in the lower salary bands, could be an effective strategy to improve retention. By addressing wage disparities and offering better financial incentives to those in lower salary brackets, the company might reduce turnover and improve workforce stability.

5.6 Employee satisfaction and performance analysis

Task 5.5. Analyzing employee satisfaction and performance
  • Analyze the average performance ratings (both ManagerRating and SelfRating) of employees who left vs. those who stayed. Use the group_by and count functions to calculate the average performance ratings for each group.

  • Visualize the distribution of SelfRating for employees who left and those who stayed using a bar plot. Use the ggplot function to create the plot and map the SelfRating variable to the x argument and the bi_attrition variable to the fill argument.

  • Similarly, visualize the distribution of ManagerRating for employees who left and those who stayed using a bar plot. Make sure to map the ManagerRating variable to the x argument and the bi_attrition variable to the fill argument.

  • Create a boxplot of salary by job_satisfaction and bi_attrition to analyze the relationship between salary, job satisfaction, and attrition. Use the geom_boxplot function to create the plot and map the salary variable to the x argument, the job_satisfaction variable to the y argument, and the bi_attrition variable to the fill argument. You need to transform the job_satisfaction and bi_attrition variables into factors before creating the plot or within the ggplot function.

  • Discuss the results of the analysis and provide recommendations for HR interventions based on the findings.

# Analyze the average performance ratings (both ManagerRating and SelfRating) of employees who left vs. those who stayed.

avg_ratings <- hr_perf_dta %>%
  group_by(bi_attrition) %>%
  summarise(
    avg_manager_rating = mean(manager_rating, na.rm = TRUE),
    avg_self_rating = mean(self_rating, na.rm = TRUE),
    count_employees = n()  
  )
# Visualize the distribution of SelfRating for employees who left and those who stayed using a bar plot.

ggplot(hr_perf_dta, aes(x = self_rating, fill = as.factor(bi_attrition))) +
  geom_bar(position = "dodge") + 
  labs(
    title = "Distribution of Self-Rating for Employees Who Stayed vs Left",
    x = "Self-Rating",
    y = "Count",
    fill = "Attrition (0 = Stayed, 1 = Left)"
  ) +
  theme_classic()

# Visualize the distribution of ManagerRating for employees who left and those who stayed using a bar plot.

ggplot(hr_perf_dta, aes(x = manager_rating, fill = as.factor(bi_attrition))) +
  geom_bar(position = "dodge") +  
  labs(
    title = "Distribution of Manager Rating for Employees Who Stayed vs Left",
    x = "Manager Rating",
    y = "Count",
    fill = "Attrition (0 = Stayed, 1 = Left)"
  ) +
  theme_classic()

# create a boxplot of salary by job_satisfaction and bi_attrition to analyze the relationship between salary, job satisfaction, and attrition.

ggplot(hr_perf_dta, aes(x = factor(job_satisfaction), y = salary, fill = factor(bi_attrition))) +
  geom_boxplot() +
  labs(
    title = "Salary Distribution by Job Satisfaction and Attrition Status",
    x = "Job Satisfaction",
    y = "Salary (in Thousands)",
    fill = "Attrition (0 = Stayed, 1 = Left)"
  ) +
  scale_fill_manual(values = c("chartreuse", "royalblue4")) + 
  theme_classic()

Discussion:

Distribution of Self-Rating for Employees Who Stayed vs Left

The graph reveals that across all self-rating levels 3, 4, and 5, more employees stayed than left the company. Interestingly, the pattern of attrition remains relatively consistent across all self-rating scores, suggesting that an employee’s self-rating may not be a strong predictor of attrition. The highest count of employees who stayed had a self-rating of 3, while the proportion of those who left remains fairly stable across all rating levels.

Distribution of Manager Rating for Employees Who Stayed vs Left

This visualization shows a more varied pattern across rating levels 2-5. The highest concentration of employees received ratings of 3 and 4 from their managers. There’s a notable bell-curve distribution for both employees who stayed and left, with fewer employees receiving either very low (2) or very high (5) ratings. The gap between those who stayed and those who left is relatively consistent across all rating levels, though it’s slightly larger at the middle ratings (3 and 4).

Salary Distribution by Job Satisfaction and Attrition Status

The plot reveals several important insights: First, there’s generally a higher salary range for employees who stayed represented in green boxes compared to those who left in blue boxes across all job satisfaction levels. There are numerous salary outliers represented by dots, particularly at higher levels, indicating some employees earn significantly above the median regardless of satisfaction level. Interestingly, even at higher job satisfaction levels at 4-5, there’s still a noticeable salary gap between those who stayed and those who left, suggesting that compensation might be a key factor in retention regardless of job satisfaction.

5.7 Work-life balance and retention strategies

Task 5.6. Analyzing work-life balance and retention strategies

At this point, you are already well aware of the dataset and the possible factors that contribute to employee attrition. Using your R skills, accomplish the following tasks:

  • Analyze the distribution of WorkLifeBalance ratings for employees who left versus those who stayed.

  • Use visualizations to show the differences.

  • Assess whether employees with poor work-life balance are more likely to leave.

You have the freedom how you will accomplish this task. Be creative and provide insights that will help HR develop effective retention strategies.

work_life_balance_summary <- hr_perf_dta %>%
  group_by(bi_attrition, work_life_balance) %>%
  summarise(count = n(), .groups = "drop")

print(work_life_balance_summary)
# A tibble: 11 × 3
   bi_attrition work_life_balance count
   <fct>                    <dbl> <int>
 1 Stayed                       1    84
 2 Stayed                       2  1134
 3 Stayed                       3  1090
 4 Stayed                       4  1146
 5 Stayed                       5   994
 6 Stayed                      NA   190
 7 Left                         1    37
 8 Left                         2   568
 9 Left                         3   580
10 Left                         4   560
11 Left                         5   516
##Use visualizations to show the differences.

# Create the bar plot for WorkLifeBalance
ggplot(hr_perf_dta, aes(x = factor(work_life_balance), fill = factor(bi_attrition))) +
  geom_bar(position = "dodge") +
  labs(
    title = "Distribution of Work-Life Balance for Employees Who Stayed vs Left",
    x = "Work-Life Balance Rating",
    y = "Count",
    fill = "Attrition (0 = Stayed, 1 = Left)"
  ) +
  theme_minimal() + 
  scale_fill_manual(values = c("springgreen4", "navajowhite2")) +
  theme_classic()

##Assess whether employees with poor work-life balance are more likely to leave.

# Compute attrition rate by Work-Life Balance rating
attrition_rate_wlb <- hr_perf_dta %>%
  group_by(work_life_balance) %>%
  summarise(
    total_employees = n(),
    total_attrition = sum(bi_attrition == "Left", na.rm = TRUE),  # Count where attrition is "Left"
    attrition_rate = (total_attrition / total_employees) * 100  # Calculate attrition rate as percentage
  )

# Print the attrition rate summary
print(attrition_rate_wlb)
# A tibble: 6 × 4
  work_life_balance total_employees total_attrition attrition_rate
              <dbl>           <int>           <int>          <dbl>
1                 1             121              37           30.6
2                 2            1702             568           33.4
3                 3            1670             580           34.7
4                 4            1706             560           32.8
5                 5            1510             516           34.2
6                NA             190               0            0  
# Visualize the attrition rate by WorkLifeBalance

ggplot(attrition_rate_wlb, aes(x = factor(work_life_balance), y = attrition_rate)) +
  geom_col(fill = "palegreen3", color = "black") +  
  geom_text(aes(label = paste0(round(attrition_rate, 1), "%")),
            vjust = -0.5, size = 3.5) +
  labs(
    title = "Attrition Rate by Work-Life Balance Rating",
    x = "Work-Life Balance Rating",
    y = "Attrition Rate (%)"
  ) +
  theme_classic()

5.8 Recommendations for HR interventions

Task 5.7. Recommendations for HR interventions

Based on the analysis conducted, provide recommendations for HR interventions that could help reduce employee attrition and improve overall employee satisfaction and performance. You may use the following question as guide for your recommendations and discussions.

  • What are the key factors contributing to employee attrition in the company?

    Answer:

    The analysis of employee attrition identifies several key factors influencing turnover. Salary discrepancies are significant, with employees in lower salary brackets, such as “0-50k” and “50k-100k,” showing higher attrition rates, indicating financial dissatisfaction. Work-life balance is also significant, as those struggling to manage personal and professional responsibilities are more likely to leave. Job satisfaction, influenced by salary, work-life balance, and growth opportunities, also plays a vital role in retention. Addressing these factors is essential for enhancing employee satisfaction and reducing turnover.

  • Which factors are most strongly correlated with attrition?

    Answer:

    Salary Levels: Higher attrition rates were observed among employees in lower salary ranges. This suggests that employees in lower salary brackets may feel under compensated, leading to dissatisfaction and eventually their departure. Organizations may need to evaluate whether the financial rewards they offer are competitive enough to retain their talent, particularly in the lower salary ranges.

    Years at the Company: Employees with fewer years at the company tend to have higher attrition rates. This could imply that newer employees may feel less engaged, have unmet expectations, or struggle with adapting to the company culture, leading to early turnover. It may indicate a need for better on boarding processes, stronger support systems, and more tailored retention strategies for new hires.

  • What strategies could be implemented to improve employee retention and satisfaction?

    Answer:

    To reduce attrition, a comprehensive compensation strategy is essential. Start with competitive salary reviews to ensure fair compensation across all levels, focusing on ranges with the highest attrition rates. Introducing or enhancing performance-based incentives, particularly for employees in lower salary brackets, to effectively reward and retain talent. Beyond compensation, improving work-life balance should also be considered. Implement flexible work arrangements, such as remote work and adjustable hours, to help employees manage personal and professional demands. Moreover, develop wellness programs, offering initiatives like mental health days, yoga sessions, and access to counseling services, to promote overall employee well-being.

  • How can HR leverage the insights from the analysis to develop effective retention strategies?

    Answer:

    Leveraging insights from attrition analysis enables more effective retention strategies by focusing on targeted interventions for specific groups, such as employees in lower salary ranges or those struggling with work-life balance. Data-driven policy adjustments can then be made to address the root causes of attrition, ensuring HR initiatives are aligned with employee needs. To maintain effectiveness, continuous monitoring should be practices and also establish a dashboard to track key metrics like attrition rates and satisfaction scores, allowing for timely adjustments to strategies as new trends emerge.

  • What are the potential benefits of implementing these strategies for the company?

    Answer:

    Implementing these strategies can offer several key benefits. By addressing the factors driving attrition, the company can reduce turnover costs associated with recruiting and training new employees. Enhancing work-life balance and career development will boost employee engagement and morale, while a competitive salary structure and focus on employee satisfaction can improve recruitment efforts by attracting high-quality candidates. With a stable workforce and lower attrition can enhance organizational performance, leading to higher productivity and improved service delivery. Fostering a positive workplace culture that emphasizes employee well-being and development will strengthen overall satisfaction and loyalty.